Combining data sources#

Data can come from different files and you can combine them in a single dataframe.

When working with data, storing data across multiple files or data sources is common. Combining data from different files or data sources into a single dataframe can be useful for data analysis and modelling.

In Pandas, combining data from additional files or sources is done using the merge or join functions. These functions allow us to connect data based on a common key or index, which can be helpful when analyzing related data.

This notebook will explore how to merge or join data from different files or sources into a single Pandas dataframe. We will also discuss different types of joins and merge operations and how to handle missing or duplicate data during the merging process.

How To#

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
df_geo = df[["latitude", "longitude", "ocean_proximity"]]
df_geo.head()
latitude longitude ocean_proximity
0 37.88 -122.23 NEAR BAY
1 37.86 -122.22 NEAR BAY
2 37.85 -122.24 NEAR BAY
3 37.85 -122.25 NEAR BAY
4 37.85 -122.25 NEAR BAY
df_geo.join(df["median_house_value"])
latitude longitude ocean_proximity median_house_value
0 37.88 -122.23 NEAR BAY 452600.0
1 37.86 -122.22 NEAR BAY 358500.0
2 37.85 -122.24 NEAR BAY 352100.0
3 37.85 -122.25 NEAR BAY 341300.0
4 37.85 -122.25 NEAR BAY 342200.0
... ... ... ... ...
20635 39.48 -121.09 INLAND 78100.0
20636 39.49 -121.21 INLAND 77100.0
20637 39.43 -121.22 INLAND 92300.0
20638 39.43 -121.32 INLAND 84700.0
20639 39.37 -121.24 INLAND 89400.0

20640 rows × 4 columns

df_price = df[["longitude", "latitude", "median_house_value"]]
df_geo.merge(df_price, left_on="latitude", right_on="latitude")
latitude longitude_x ocean_proximity longitude_y median_house_value
0 37.88 -122.23 NEAR BAY -122.23 452600.0
1 37.88 -122.23 NEAR BAY -122.34 350000.0
2 37.88 -122.23 NEAR BAY -122.29 216700.0
3 37.88 -122.23 NEAR BAY -122.28 261300.0
4 37.88 -122.23 NEAR BAY -122.26 391800.0
... ... ... ... ... ...
1775707 35.96 -119.04 INLAND -119.04 51700.0
1775708 35.86 -119.46 INLAND -119.46 42700.0
1775709 35.85 -119.12 INLAND -119.12 58300.0
1775710 35.89 -119.27 INLAND -119.27 53300.0
1775711 35.88 -119.27 INLAND -119.27 43700.0

1775712 rows × 5 columns

pd.concat([df_price, df_geo], join="inner", axis=1)
longitude latitude median_house_value latitude longitude ocean_proximity
0 -122.23 37.88 452600.0 37.88 -122.23 NEAR BAY
1 -122.22 37.86 358500.0 37.86 -122.22 NEAR BAY
2 -122.24 37.85 352100.0 37.85 -122.24 NEAR BAY
3 -122.25 37.85 341300.0 37.85 -122.25 NEAR BAY
4 -122.25 37.85 342200.0 37.85 -122.25 NEAR BAY
... ... ... ... ... ... ...
20635 -121.09 39.48 78100.0 39.48 -121.09 INLAND
20636 -121.21 39.49 77100.0 39.49 -121.21 INLAND
20637 -121.22 39.43 92300.0 39.43 -121.22 INLAND
20638 -121.32 39.43 84700.0 39.43 -121.32 INLAND
20639 -121.24 39.37 89400.0 39.37 -121.24 INLAND

20640 rows × 6 columns

df_geo.append(df_geo.sample(1000))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2260/902374054.py in ?()
----> 1 df_geo.append(df_geo.sample(1000))

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, name)
   5985             and name not in self._accessors
   5986             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5987         ):
   5988             return self[name]
-> 5989         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'append'

Exercise#

Familiarize yourself with the merge operations and try how inner, left, right, and outer change the merge results.

pd.concat([df_price, df_geo], join="inner", axis=1)
longitude latitude median_house_value latitude longitude ocean_proximity
0 -122.23 37.88 452600.0 37.88 -122.23 NEAR BAY
1 -122.22 37.86 358500.0 37.86 -122.22 NEAR BAY
2 -122.24 37.85 352100.0 37.85 -122.24 NEAR BAY
3 -122.25 37.85 341300.0 37.85 -122.25 NEAR BAY
4 -122.25 37.85 342200.0 37.85 -122.25 NEAR BAY
... ... ... ... ... ... ...
20635 -121.09 39.48 78100.0 39.48 -121.09 INLAND
20636 -121.21 39.49 77100.0 39.49 -121.21 INLAND
20637 -121.22 39.43 92300.0 39.43 -121.22 INLAND
20638 -121.32 39.43 84700.0 39.43 -121.32 INLAND
20639 -121.24 39.37 89400.0 39.37 -121.24 INLAND

20640 rows × 6 columns

Additional Resources#